Corpora Preparation and Stopword List Generation for Arabic data in Social Network

نویسندگان

  • Walaa Medhat
  • Ahmed Hassan Yousef
  • Hoda Korashy Mohamed
چکیده

This paper proposes a methodology to prepare corpora in Arabic language from online social network (OSN) and review site for Sentiment Analysis (SA) task. The paper also proposes a methodology for generating a stopword list from the prepared corpora. The aim of the paper is to investigate the effect of removing stopwords on the SA task. The problem is that the stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN corpora. We compare the efficiency of text classification when using the generated lists along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The text classification was performed using Naïve Bayes and Decision Tree classifiers and two feature selection approaches, unigrams and bigram. The experiments show that the general lists containing the Egyptian dialects words give better performance than using lists of MSA stopwords only.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Egyptian Dialect Stopword List Generation from Social Network Data

This paper proposes a methodology for generating a stopword list from online social network (OSN) corpora in Egyptian Dialect (ED). The aim of the paper is to investigate the effect of removing ED stopwords on the Sentiment Analysis (SA) task. The stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A Stemming Procedure and Stopword List for General French Corpora

Due to the increasing use of network-based systems, there is a growing interest in access to and search mechanisms for text databases in languages other than English. To adapt searching systems to those foreign languages with characteristics similar to the English language, all we need to do for the most part is to establish a general stopword list and a stemming procedure. This article present...

متن کامل

Automatic Stopword Generation using Contextual Semantics for Sentiment Analysis of Twitter

In this paper we propose a semantic approach to automatically identify and remove stopwords from Twitter data. Unlike most existing approaches, which rely on outdated and context-insensitive stopword lists, our proposed approach considers the contextual semantics and sentiment of words in order to measure their discrimination power. Evaluation results on 6 Twitter datasets show that, removing o...

متن کامل

Ditch the Smileys: Customizing a Stopword List for Email-based Data

The study uses grounded theory approach to develop different categories of stopwords leading to the creation of a stopword list for email-based data. The finding of the study will contribute in better understanding of email as data and developing better algorithms which could automatically remove specific category of stopwords. Résumé : Cette étude se base sur la théorie à base empirique pour d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1410.1135  شماره 

صفحات  -

تاریخ انتشار 2014